Take Home Exercise-3

Reveal the economic of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphics methods for vast 2022 challenge.

Shachi Anirudha Raodeo https://github.com/ShachiR/ISSS608 (School of Computing and Information Systems)
2022-05-25

The Task

This take-home exercise aims to reveal the economic of the city of Engagement, Ohio USA by using visualization techniques in R.

Economic considers the financial health of the city. Over time, are businesses growing or shrinking? How are people changing jobs? Are standards of living improving or declining over time?

Consider the financial status of Engagement’s businesses and residents, and use visual analytic techniques to address these questions.

Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling?

Links to the dataset:

CheckinJournal.csv TravelJournal.csv

Step-by-step Data Visualisation

Installing and launching R packages

Packages, namely tidyverse and ggplot2 are required for this exercise. This code chunk installs the required packages and loads them onto RStudio environment.

packages = c('tidyverse','ggplot2','ggdist', 'ggridges','patchwork', 'ggthemes',                      'hrbrthemes','ggrepel','ggforce',"HH","vcd",'scales','grid','gridExtra',
             'formattable','readr', 'ggiraph', 'plotly', 'DT', 'gganimate','readxl',                  'gifski', 'gapminder','treemap', 'treemapify','rPackedBar','ggstatsplot','ggside','broom','crosstalk')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Data Preparation

Data Source

The dataset used in this exercise is Participants.csv, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)

Importing the dataset

The code chunk below imports Restaurants.csv and TravelJournal.csv and Pubs.csv from the data folder into R by using read_csv() function of readr and saves it as Tibble data frame called travel and restaurants and pubs

restaurants <- read_csv("data/Restaurants.csv")
summary(restaurants)
  restaurantId       foodCost      maxOccupancy      location        
 Min.   : 445.0   Min.   :4.070   Min.   : 48.00   Length:20         
 1st Qu.: 783.5   1st Qu.:4.395   1st Qu.: 71.75   Class :character  
 Median :1122.0   Median :5.095   Median : 85.00   Mode  :character  
 Mean   :1123.5   Mean   :5.035   Mean   : 84.70                     
 3rd Qu.:1462.0   3rd Qu.:5.678   3rd Qu.:103.25                     
 Max.   :1805.0   Max.   :5.920   Max.   :119.00                     
   buildingId   
 Min.   : 27.0  
 1st Qu.:151.0  
 Median :294.5  
 Mean   :432.9  
 3rd Qu.:754.2  
 Max.   :991.0  
pubs <- read_csv("data/Pubs.csv")
summary(pubs)
     pubId        hourlyCost      maxOccupancy     location        
 Min.   : 442   Min.   : 6.417   Min.   :60.00   Length:12         
 1st Qu.: 780   1st Qu.: 9.725   1st Qu.:64.00   Class :character  
 Median :1118   Median :11.035   Median :69.50   Mode  :character  
 Mean   :1120   Mean   :10.866   Mean   :71.83                     
 3rd Qu.:1458   3rd Qu.:12.379   3rd Qu.:77.50                     
 Max.   :1800   Max.   :14.840   Max.   :96.00                     
   buildingId    
 Min.   :  29.0  
 1st Qu.: 237.0  
 Median : 495.5  
 Mean   : 484.8  
 3rd Qu.: 595.5  
 Max.   :1012.0  
travel <- read_csv("data/TravelJournal.csv")
summary(restaurants)

Missing value check

Before we proceed with data visualization we check if the datasets chosen have any null values using the code below.

apply(restaurants, 2, function(x) any(is.na(x)))
restaurantId     foodCost maxOccupancy     location   buildingId 
       FALSE        FALSE        FALSE        FALSE        FALSE 
apply(pubs, 2, function(x) any(is.na(x)))
       pubId   hourlyCost maxOccupancy     location   buildingId 
       FALSE        FALSE        FALSE        FALSE        FALSE 
apply(travel, 2, function(x) any(is.na(x)))

Since there are no null values observed we proceed with our next step.

Data Wrangling

The Travel Journal contains financial transactions by a participant towards Work/Home Commute, Eating, Coming Back From Restaurant,Recreation (Social Gathering), Going Back to Home. We filter out the records related to Eating and Recreation (Social Gathering).

travel_filt <- travel[travel$purpose %in% c("Eating","Recreation (Social Gathering)"),]

Calculating Amount Spent

Calculating the total amount spent at the location as a difference of the starting balance and ending balance in the travel journal

travel_filt$amountSpent <- travel_filt$startingBalance -travel_filt$endingBalance
saveRDS(travel_filt, 'data/travel_filt.rds')
travel_filt <- readRDS('data/travel_filt.rds')
head(travel_filt)
# A tibble: 6 × 11
  participantId travelStartTime     travelStartLocationId
          <dbl> <dttm>                              <dbl>
1            23 2022-03-01 05:20:00                   532
2           876 2022-03-01 05:50:00                    NA
3           902 2022-03-01 06:05:00                    NA
4           919 2022-03-01 06:00:00                    NA
5           154 2022-03-01 05:55:00                    NA
6           509 2022-03-01 06:00:00                    NA
# … with 8 more variables: travelEndTime <dttm>,
#   travelEndLocationId <dbl>, purpose <chr>, checkInTime <dttm>,
#   checkOutTime <dttm>, startingBalance <dbl>, endingBalance <dbl>,
#   amountSpent <dbl>

Data Visualization

Grouping the data by the travelEndLocationId which is equal to the restaurant ID or the pub ID.

travel_group = travel_filt %>%group_by(travelEndLocationId) %>%
  summarise(amountSpent = sum(amountSpent), 
            .groups = 'drop')%>%
  arrange(desc(amountSpent))
library(dplyr)
data_merge <-merge(x=travel_group, y=restaurants, by.x = 'travelEndLocationId', by.y =  'restaurantId')
data_merge$travelEndLocationId <- as.character(data_merge$travelEndLocationId)
data_merge$amountSpent <- data_merge$amountSpent/1000

above plot is modified considering the principles of data visualization. The above plot has simple X and Y axis, and is not so visually appealing. A line has been plotted along with the bar chart to make the differences in count easily visible to the reader.

ggplot(data = data_merge,aes(x = travelEndLocationId, y=amountSpent)) +
 geom_col()+
  xlab("Business ID") +
  ylab("Revenue (in thousands $)") 

library(plotly)

color = c('rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)',
          'rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)')

fig<- plot_ly(data_merge, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))

fig <- fig %>% layout(title = "Revenue for Restraunts",
         xaxis = list(title = "Retaurant ID"),
         yaxis = list(title = "Revenue (in thousands $)"))

fig
data_pub <-merge(x=travel_group, y=pubs, by.x = "travelEndLocationId", by.y = "pubId")
data_pub$travelEndLocationId <- as.character(data_pub$travelEndLocationId)
data_pub$amountSpent <- data_pub$amountSpent/1000
library(plotly)
color = c('rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)')
fig<- plot_ly(data_pub, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))

fig <- fig %>% layout(title = "Revenue for Pubs",
         xaxis = list(title = "Pub ID"),
         yaxis = list(title = "Revenue (in thousands $)"))

fig
treemap(data_merge,
        index=c("travelEndLocationId"), 
        vSize="amountSpent",
        vColor="amountSpent",
        title="Amount Spent in thousands of Dollars - Restaurants",
        title.legend = "Amount Spent in thousands of Dollars - Restaurants"
        )

library(ggdist)
travel_filt %>%
  ggplot(aes(x = purpose, 
             y = amountSpent)) +
  stat_gradientinterval(
    fill = "skyblue",
    show.legend = TRUE
  ) +
   theme_bw()+
  scale_y_continuous(breaks = seq(0, 35, by = 5), 
                     limits = c(0, 35))+
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Gradient + interval plot")+
    labs(
    y= 'Revenue',
    x= 'Purpose',
    title = "Revenue generated by Purpose",
    caption = "Economic information, Ohio USA"
  )+
  theme(
    axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    
  )